The document contains a detailed analysis of forecast sale expectations of High-End Vacuums for Vac-Attack company in December 2020 using historical data. This analysis aims to determine if the company will meet the target and ensure the stock is on hand to match the demand.
The report describes data cleaning, summary statistics, visualisation and analysis using the statistical model fitting, implemented with R programming language.
if (!require(tidyverse))
install.packages("tidyverse")
if (!require(lubridate))
install.packages("lubridate")
if (!require(plotly))
install.packages("plotly")
if (!require(car))
install.packages("stargazer")
if (!require(xgboost))
install.packages("xgboost")
library(tidyverse)
library(lubridate)
library(stargazer)
library(xgboost)First import the Market data.
market_sale_col <- readr::read_csv("Data/MarketingCols.csv", col_names = FALSE)
market_sale_raw <- readr::read_csv("Data/MarketingSales.csv",
col_names = market_sale_col$X1)Then import the December data for prediction.
dec_col <- readr::read_csv("Data/DecemberCols.csv", col_names = FALSE)
dec_ad_raw <- readr::read_csv("Data/DecemberAdData.csv",
col_names = dec_col$X1)Quick explore of the read-in the data sets
dim(market_sale_raw)## [1] 1796 11
glimpse(market_sale_raw)## Rows: 1,796
## Columns: 11
## $ Date <chr> "1/01/16", "2/01/16", "3/01/16", "4/01/16", "5/…
## $ PositiveNews <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ NegativeCoverage <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Competition <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AdvertisingSpend <dbl> 4199.86, 14768.20, 11019.79, 3358.30, 5207.19, …
## $ Month <chr> "January", "January", "January", "January", "Ja…
## $ Day <chr> "Friday", "Saturday", "Sunday", "Monday", "Tues…
## $ `0508Line_247` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ UltraEdition_Available <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ COVID_Lockdown <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Sales <dbl> 66, 84, 78, 70, 73, 67, 85, 78, 80, 87, 68, 83,…
dim(dec_ad_raw)## [1] 31 4
glimpse(dec_ad_raw)## Rows: 31
## Columns: 4
## $ Date <chr> "1/12/20", "2/12/20", "3/12/20", "4/12/20", "5/12/20"…
## $ AdvertisingSpend <dbl> 10568.28, 8218.31, 4907.38, 8057.25, 21481.50, 484.17…
## $ Month <chr> "December", "December", "December", "December", "Dece…
## $ Day <chr> "Tuesday", "Wednesday", "Thursday", "Friday", "Saturd…
The next step is to tidy up the data sets for visualisation and
modelling. I have included year, month and day variables in both
factor and numeric formats.
market_sale_final <-
market_sale_raw %>%
mutate(Date = dmy(Date),
Ad_Spend = AdvertisingSpend,
Phone_24 = factor(`0508Line_247`),
Positive_News = factor(PositiveNews),
Negative_News = factor(NegativeCoverage),
Competition = factor(Competition),
Ultra_Edition = factor( UltraEdition_Available),
COVID_Lockdown = factor(COVID_Lockdown),
Year = factor(year(Date)),
Month = factor(Month, levels = month.name),
Day = factor(Day, levels =
c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")),
Month_num = as.numeric(Month),
Year_num = year(Date),
Day_num = as.numeric(Day),
Date_num = date(Date)) %>%
select(Sales, Ad_Spend, Date, Phone_24, Positive_News, Negative_News,
Competition, Ultra_Edition, COVID_Lockdown,
Year, Month, Day, Day_num, Date_num, Month_num, Year_num)
dim(market_sale_final)## [1] 1796 16
glimpse(market_sale_final)## Rows: 1,796
## Columns: 16
## $ Sales <dbl> 66, 84, 78, 70, 73, 67, 85, 78, 80, 87, 68, 83, 62, 94,…
## $ Ad_Spend <dbl> 4199.86, 14768.20, 11019.79, 3358.30, 5207.19, 3962.76,…
## $ Date <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04, 2016-0…
## $ Phone_24 <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Positive_News <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ Negative_News <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Competition <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ultra_Edition <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ COVID_Lockdown <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Year <fct> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
## $ Month <fct> January, January, January, January, January, January, J…
## $ Day <fct> Friday, Saturday, Sunday, Monday, Tuesday, Wednesday, T…
## $ Day_num <dbl> 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2…
## $ Date_num <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04, 2016-0…
## $ Month_num <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Year_num <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
The following dataset is used to predict the total units sold in December 2020. Thus, this dataset contains the advertising spent without the expectation of competition or news stories. However, the Ultra Vac will continue to be sold.
dec_ad_final <-
dec_ad_raw %>%
mutate(
Date = dmy(Date),
Ad_Spend = AdvertisingSpend,
Phone_24 = factor(0, levels = c(0, 1)),
Positive_News = factor(0, levels = c(0, 1)),
Negative_News = factor(0, levels = c(0, 1)),
Competition = factor(0, levels = c(0, 1)),
Ultra_Edition = factor(1, levels = c(0, 1)),
COVID_Lockdown = factor(0, levels = c(0, 1)),
Year = factor(2020, levels = market_sale_final$Year |> levels() ),
Month = factor(Month, levels = month.name),
Day = factor(Day, levels =
c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")),
Year_num = 2020,
Month_num = 12,
Day_num = as.numeric(Day),
Date_num = date(Date)
) %>%
select(Ad_Spend, Date, Phone_24, Positive_News, Negative_News,
Competition, Ultra_Edition,
COVID_Lockdown, Year, Month, Day, Day_num, Date_num, Month_num, Year_num)
dim(dec_ad_final)## [1] 31 15
glimpse(dec_ad_final)## Rows: 31
## Columns: 15
## $ Ad_Spend <dbl> 10568.28, 8218.31, 4907.38, 8057.25, 21481.50, 484.17, …
## $ Date <date> 2020-12-01, 2020-12-02, 2020-12-03, 2020-12-04, 2020-1…
## $ Phone_24 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Positive_News <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Negative_News <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Competition <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ultra_Edition <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ COVID_Lockdown <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Year <fct> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…
## $ Month <fct> December, December, December, December, December, Decem…
## $ Day <fct> Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday,…
## $ Day_num <dbl> 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6…
## $ Date_num <date> 2020-12-01, 2020-12-02, 2020-12-03, 2020-12-04, 2020-1…
## $ Month_num <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,…
## $ Year_num <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…
The first step is to generate some plots to look for any relationships and outliers.
The first plot is a line plot on total unit sold versus time.
ggplot(market_sale_final, aes(x = Date, y = Sales)) +
geom_path() +
scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01),
date_labels = "%b %Y") +
theme_light()There is clearing some seasonality on the total unit sold versus time for adjustment, thus, Year and Month variables might be important variables for the modelling fitting.
The second plot examines the relationship between the total unit sold versus the total advertising spend.
ggplot(market_sale_final, aes(x = Ad_Spend, y = Sales)) +
geom_point()There is a weak positive correlation between the Advertising Spend and total unit sold, with the Pearson correlation of 41.1%.
The plot is to see if the Advertising Spend logarithm transform can improve the linear relationship to the total unit sold.
ggplot(market_sale_final, aes(x = log(Ad_Spend +1), y = Sales)) +
geom_point()The correlation between the Advertising Spend and total unit sold is still weak, with the Pearson correlation becomes worse of 30.4%. Thus, I will not logarithm transform the total Advertising Spend values.
The following plot below is a line plot on total sold units and advertising spend across time.
ggplot(market_sale_final, aes(x = Date)) +
geom_line(aes(y = Sales), linewidth = 1.5, color = "red", alpha = 0.8) +
geom_line(aes(y = Ad_Spend/100), color = "blue", alpha = 0.8) +
scale_y_continuous(
name = "The units sold",
# Add a second axis and specify its features
sec.axis = sec_axis(~.*100, name="Total Advertising Spend ($)")
) +
scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01),
date_labels = "%b %Y") +
theme_light() +
theme(
axis.title.y = element_text(color = "red", size=13),
axis.title.y.right = element_text(color = "blue", size=13)
)Again, there is some weak relationship between the total sold units and advertising spend across time, but modelling will require further confirm its relationship.
The following plot is to examine the distribution of the total units sold, to look for any possible skewness, which might require other data transformation.
hist(market_sale_final$Sales)The total number of Sales is looking reasonably normal, i.e. bell-shaped; thus there is no need to apply any additional normalisation methods.
The following set of tables contains some summary statistics to allow me to have some initial inspection on the total units sold against of the variables of the datasets.
Total units sold versus years.
market_sale_final %>%
group_by(Year) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales)) %>%
knitr::kable()| Year | Sales_mean | Sales_sum |
|---|---|---|
| 2016 | 89.48087 | 32750 |
| 2017 | 82.02740 | 29940 |
| 2018 | 76.81370 | 28037 |
| 2019 | 78.07671 | 28498 |
| 2020 | 66.04179 | 22124 |
The total units sold appears to decline from year to year.
Total units sold versus months
market_sale_final %>%
group_by(Month) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Month | Sales_mean | Sales_sum |
|---|---|---|
| January | 66.17419 | 10257 |
| February | 65.78169 | 9341 |
| March | 64.52258 | 10001 |
| April | 59.66000 | 8949 |
| May | 65.76129 | 10193 |
| June | 71.71333 | 10757 |
| July | 71.52258 | 11086 |
| August | 80.06452 | 12410 |
| September | 88.28000 | 13242 |
| October | 91.63226 | 14203 |
| November | 107.91333 | 16187 |
| December | 118.73387 | 14723 |
The most of total units sold appears to be at the later months of the year.
Total units sold versus days of the week
market_sale_final %>%
group_by(Day) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Day | Sales_mean | Sales_sum |
|---|---|---|
| Monday | 78.26848 | 20115 |
| Tuesday | 78.85156 | 20186 |
| Wednesday | 77.59766 | 19865 |
| Thursday | 78.82422 | 20179 |
| Friday | 79.33852 | 20390 |
| Saturday | 78.82101 | 20257 |
| Sunday | 79.21012 | 20357 |
The total units sold appears to very similar between the days of each week.
Total units sold versus having the Ultra Edition.
market_sale_final %>%
group_by(Ultra_Edition) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Ultra_Edition | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 81.28650 | 74052 |
| 1 | 76.04181 | 67297 |
The total units sold appears to be similar with or without the Ultra Edition.
Total units sold versus having the Positive news.
market_sale_final %>%
group_by(Positive_News) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales)) %>%
knitr::kable()| Positive_News | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 78.22094 | 135948 |
| 1 | 93.12069 | 5401 |
The total units sold is higher with the Positive news.
Total units sold versus having the Negative news.
market_sale_final %>%
group_by(Negative_News) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales)) %>%
knitr::kable()| Negative_News | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 78.85618 | 140364 |
| 1 | 61.56250 | 985 |
The total units sold is higher without the Negative news.
Total units sold versus having the Competition
market_sale_final %>%
group_by(Competition) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Competition | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 80.09143 | 111247 |
| 1 | 73.96069 | 30102 |
The total units sold appears to be similar with or without the Competition.
Total units sold versus having the COVID Lockdown
market_sale_final %>%
group_by(COVID_Lockdown) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| COVID_Lockdown | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 80.24484 | 139947 |
| 1 | 26.96154 | 1402 |
The total units sold is significant higher without the COVID Lockdown.
Total units sold versus having the 24/7 call line.
market_sale_final %>%
group_by(Phone_24) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Phone_24 | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 76.10649 | 95057 |
| 1 | 84.62888 | 46292 |
The total units sold appears to be higher when there is 24/7 call line available.
Since the distribution of the total units sold is relatively normal with a belt-shaped curve, I will start by fitting a multiple linear regression model for the total units sold, as seen below. Using a multiple linear regression model is a good starting point, because it is relatively straightforward to fit the model and perform model diagnostics to decide whether a more complicated modelling technique is required. Further, linear regression analysis makes inference easy, i.e. interpreting which predictor variable has statistically significant effects or has the most influence on the target variable. Lastly, linear regression analysis can be used to make predictions about the value of the target variable based on the values of the predictor variables.
linear_reg_fit <-
lm(
Sales ~ Ad_Spend + (Year_num * Month) / Day +
COVID_Lockdown + Ultra_Edition + Competition + Phone_24 +
Positive_News + Negative_News,
data = market_sale_final
)ANOVA table of the initial model
anova(linear_reg_fit)## Analysis of Variance Table
##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## Ad_Spend 1 151427 151427 4567.3059 < 2.2e-16 ***
## Year_num 1 84876 84876 2560.0098 < 2.2e-16 ***
## Month 11 520587 47326 1427.4418 < 2.2e-16 ***
## COVID_Lockdown 1 43105 43105 1300.1339 < 2.2e-16 ***
## Ultra_Edition 1 1042 1042 31.4209 2.421e-08 ***
## Competition 1 5867 5867 176.9491 < 2.2e-16 ***
## Phone_24 1 12119 12119 365.5325 < 2.2e-16 ***
## Positive_News 1 9669 9669 291.6337 < 2.2e-16 ***
## Negative_News 1 7207 7207 217.3837 < 2.2e-16 ***
## Year_num:Month 11 3837 349 10.5219 < 2.2e-16 ***
## Year_num:Month:Day 72 1698 24 0.7114 0.9677
## Residuals 1693 56130 33
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Choose a model by AIC in a Stepwise Algorithm
linear_reg_fit_final <- step(linear_reg_fit) ## Start: AIC=6388.04
## Sales ~ Ad_Spend + (Year_num * Month)/Day + COVID_Lockdown +
## Ultra_Edition + Competition + Phone_24 + Positive_News +
## Negative_News
##
## Df Sum of Sq RSS AIC
## - Year_num:Month:Day 72 1698 57829 6297.6
## - Competition 1 0 56131 6386.0
## - Ultra_Edition 1 62 56192 6388.0
## <none> 56130 6388.0
## - Negative_News 1 7195 63325 6602.7
## - Positive_News 1 8542 64672 6640.5
## - Phone_24 1 8939 65070 6651.5
## - COVID_Lockdown 1 30613 86743 7167.8
## - Ad_Spend 1 136145 192276 8597.4
##
## Step: AIC=6297.57
## Sales ~ Ad_Spend + Year_num + Month + COVID_Lockdown + Ultra_Edition +
## Competition + Phone_24 + Positive_News + Negative_News +
## Year_num:Month
##
## Df Sum of Sq RSS AIC
## - Competition 1 0 57829 6295.6
## <none> 57829 6297.6
## - Ultra_Edition 1 70 57899 6297.7
## - Year_num:Month 11 3837 61666 6391.0
## - Negative_News 1 7533 65361 6515.5
## - Phone_24 1 8970 66798 6554.5
## - Positive_News 1 8970 66799 6554.6
## - COVID_Lockdown 1 30820 88648 7062.8
## - Ad_Spend 1 140497 198326 8509.0
##
## Step: AIC=6295.57
## Sales ~ Ad_Spend + Year_num + Month + COVID_Lockdown + Ultra_Edition +
## Phone_24 + Positive_News + Negative_News + Year_num:Month
##
## Df Sum of Sq RSS AIC
## <none> 57829 6295.6
## - Ultra_Edition 1 87 57916 6296.3
## - Year_num:Month 11 3858 61686 6389.6
## - Negative_News 1 7533 65362 6513.5
## - Positive_News 1 8971 66800 6552.6
## - Phone_24 1 14180 72009 6687.4
## - COVID_Lockdown 1 33773 91602 7119.7
## - Ad_Spend 1 140502 198330 8507.1
ANOVA table of the final model
anova(linear_reg_fit_final) ## Analysis of Variance Table
##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## Ad_Spend 1 151427 151427 4624.336 < 2.2e-16 ***
## Year_num 1 84876 84876 2591.975 < 2.2e-16 ***
## Month 11 520587 47326 1445.266 < 2.2e-16 ***
## COVID_Lockdown 1 43105 43105 1316.368 < 2.2e-16 ***
## Ultra_Edition 1 1042 1042 31.813 1.973e-08 ***
## Phone_24 1 17964 17964 548.581 < 2.2e-16 ***
## Positive_News 1 9666 9666 295.195 < 2.2e-16 ***
## Negative_News 1 7211 7211 220.227 < 2.2e-16 ***
## Year_num:Month 11 3858 351 10.710 < 2.2e-16 ***
## Residuals 1766 57829 33
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
hist(linear_reg_fit_final$residuals)plot(fitted(linear_reg_fit_final), resid(linear_reg_fit_final) )# calculate the mean squared error
mse_linear <-
sum(market_sale_final$Sales -
predict.glm(linear_reg_fit_final,
newdata = market_sale_final,
type = "response") ) ^2The AIC and mean square error from this multiple linear regression model are 1.1394401^{4} and 2.1707956^{-13}, respectively. This histogram of the residual is relatively normal, and the residual plot is randomly scattered around the residual of zero.
summary(linear_reg_fit_final) ##
## Call:
## lm(formula = Sales ~ Ad_Spend + Year_num + Month + COVID_Lockdown +
## Ultra_Edition + Phone_24 + Positive_News + Negative_News +
## Year_num:Month, data = market_sale_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.7034 -3.9216 -0.1025 3.4740 26.5912
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.996e+03 8.309e+02 8.419 < 2e-16 ***
## Ad_Spend 1.428e-03 2.181e-05 65.504 < 2e-16 ***
## Year_num -3.442e+00 4.118e-01 -8.360 < 2e-16 ***
## MonthFebruary -4.700e+03 9.452e+02 -4.972 7.27e-07 ***
## MonthMarch -3.356e+03 9.330e+02 -3.597 0.000331 ***
## MonthApril -6.489e+03 1.033e+03 -6.280 4.26e-10 ***
## MonthMay -6.430e+03 9.516e+02 -6.756 1.91e-11 ***
## MonthJune -6.999e+03 9.358e+02 -7.479 1.17e-13 ***
## MonthJuly -7.448e+03 9.346e+02 -7.970 2.83e-15 ***
## MonthAugust -7.815e+03 9.346e+02 -8.362 < 2e-16 ***
## MonthSeptember -6.771e+03 9.418e+02 -7.190 9.55e-13 ***
## MonthOctober -5.059e+03 9.340e+02 -5.416 6.91e-08 ***
## MonthNovember -6.112e+03 9.427e+02 -6.483 1.16e-10 ***
## MonthDecember -6.701e+03 1.142e+03 -5.866 5.32e-09 ***
## COVID_Lockdown1 -3.486e+01 1.085e+00 -32.115 < 2e-16 ***
## Ultra_Edition1 9.210e-01 5.638e-01 1.633 0.102545
## Phone_241 1.101e+01 5.292e-01 20.810 < 2e-16 ***
## Positive_News1 1.271e+01 7.680e-01 16.552 < 2e-16 ***
## Negative_News1 -2.188e+01 1.443e+00 -15.167 < 2e-16 ***
## Year_num:MonthFebruary 2.329e+00 4.684e-01 4.972 7.27e-07 ***
## Year_num:MonthMarch 1.663e+00 4.623e-01 3.598 0.000330 ***
## Year_num:MonthApril 3.216e+00 5.121e-01 6.280 4.25e-10 ***
## Year_num:MonthMay 3.188e+00 4.716e-01 6.760 1.87e-11 ***
## Year_num:MonthJune 3.471e+00 4.637e-01 7.485 1.12e-13 ***
## Year_num:MonthJuly 3.695e+00 4.631e-01 7.978 2.65e-15 ***
## Year_num:MonthAugust 3.880e+00 4.631e-01 8.379 < 2e-16 ***
## Year_num:MonthSeptember 3.367e+00 4.667e-01 7.215 7.97e-13 ***
## Year_num:MonthOctober 2.522e+00 4.628e-01 5.449 5.79e-08 ***
## Year_num:MonthNovember 3.051e+00 4.672e-01 6.531 8.53e-11 ***
## Year_num:MonthDecember 3.347e+00 5.662e-01 5.912 4.05e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.722 on 1766 degrees of freedom
## Multiple R-squared: 0.9356, Adjusted R-squared: 0.9345
## F-statistic: 884.3 on 29 and 1766 DF, p-value: < 2.2e-16
sum_stats <- summary(linear_reg_fit_final)Note that this multiple linear regression model has an adjusted R-square of 0.9345135, which means this model can explain 93.45% of the information, thus a very accurate model.
There are some interesting insights has shown from this model:
For every $1000 spent on the advertising the total daily units sold is increased by 1.43,
For every one year increase, the average total daily units sold is increase by 3.44,
On the other hand, December is the best performing month in terms of total daily units sold, 6700.65 more compared to January.
COVID Lockdown has been shown to reduce the total daily units sold by 34.86.
Positive coverage has been shown to increase the total daily units sold by 12.71,
Negative coverage has shown to reduce the total daily units sold by 21.88
Having the 24/7 call line increase the total daily units sold by 11.01
Finally, the final multiple linear regression model suggests there are no statistically significant effects on the total units sold on:
Poisson regression is often used for modelling count data. Thus, below is the modelling results using the Poisson regression analysis. I used the identity link function here, as there is no need to perform any additional transformation on the target variable from the above inspection on the distribution.
poisson_reg_fit <-
glm(
Sales ~ Ad_Spend + (Year_num * Month) / Day +
COVID_Lockdown + Ultra_Edition + Competition + Phone_24 +
Positive_News + Negative_News,
data = market_sale_final,
family = poisson(link = "identity")
)ANOVA table of the initial model
car::Anova(poisson_reg_fit, type = 3) ## Analysis of Deviance Table (Type III tests)
##
## Response: Sales
## LR Chisq Df Pr(>Chisq)
## Ad_Spend 1753.23 1 < 2.2e-16 ***
## Year_num 17.16 1 3.438e-05 ***
## Month 55.86 11 5.385e-08 ***
## COVID_Lockdown 645.92 1 < 2.2e-16 ***
## Ultra_Edition 0.41 1 0.5218
## Competition 0.01 1 0.9039
## Phone_24 109.69 1 < 2.2e-16 ***
## Positive_News 102.92 1 < 2.2e-16 ***
## Negative_News 96.66 1 < 2.2e-16 ***
## Year_num:Month 56.13 11 4.808e-08 ***
## Year_num:Month:Day 23.81 72 1.0000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Choose a model by AIC in a Stepwise Algorithm
poisson_reg_fit_final <- step(poisson_reg_fit)## Start: AIC=12024.14
## Sales ~ Ad_Spend + (Year_num * Month)/Day + COVID_Lockdown +
## Ultra_Edition + Competition + Phone_24 + Positive_News +
## Negative_News
##
## Df Deviance AIC
## - Year_num:Month:Day 72 777.31 11904
## - Competition 1 753.52 12022
## - Ultra_Edition 1 753.91 12023
## <none> 753.50 12024
## - Negative_News 1 850.16 12119
## - Positive_News 1 856.43 12125
## - Phone_24 1 863.20 12132
## - COVID_Lockdown 1 1399.42 12668
## - Ad_Spend 1 2506.74 13775
##
## Step: AIC=11903.95
## Sales ~ Ad_Spend + Year_num + Month + COVID_Lockdown + Ultra_Edition +
## Competition + Phone_24 + Positive_News + Negative_News +
## Year_num:Month
##
## Df Deviance AIC
## - Competition 1 777.32 11902
## - Ultra_Edition 1 777.79 11902
## <none> 777.31 11904
## - Year_num:Month 11 832.53 11937
## - Negative_News 1 878.02 12003
## - Positive_News 1 885.49 12010
## - Phone_24 1 887.32 12012
## - COVID_Lockdown 1 1428.28 12553
## - Ad_Spend 1 2586.82 13712
##
## Step: AIC=11901.96
## Sales ~ Ad_Spend + Year_num + Month + COVID_Lockdown + Ultra_Edition +
## Phone_24 + Positive_News + Negative_News + Year_num:Month
##
## Df Deviance AIC
## - Ultra_Edition 1 777.98 11901
## <none> 777.32 11902
## - Year_num:Month 11 833.35 11936
## - Negative_News 1 878.06 12001
## - Positive_News 1 885.50 12008
## - Phone_24 1 953.41 12076
## - COVID_Lockdown 1 1499.11 12622
## - Ad_Spend 1 2587.15 13710
##
## Step: AIC=11900.61
## Sales ~ Ad_Spend + Year_num + Month + COVID_Lockdown + Phone_24 +
## Positive_News + Negative_News + Year_num:Month
##
## Df Deviance AIC
## <none> 777.98 11901
## - Year_num:Month 11 833.78 11934
## - Negative_News 1 878.94 12000
## - Positive_News 1 886.18 12007
## - Phone_24 1 966.16 12087
## - COVID_Lockdown 1 1507.78 12628
## - Ad_Spend 1 2589.46 13710
ANOVA table of the final model
car::Anova(poisson_reg_fit_final, type = 3)## Analysis of Deviance Table (Type III tests)
##
## Response: Sales
## LR Chisq Df Pr(>Chisq)
## Ad_Spend 1811.48 1 < 2.2e-16 ***
## Year_num 36.31 1 1.681e-09 ***
## Month 55.55 11 6.138e-08 ***
## COVID_Lockdown 729.81 1 < 2.2e-16 ***
## Phone_24 188.19 1 < 2.2e-16 ***
## Positive_News 108.20 1 < 2.2e-16 ***
## Negative_News 100.97 1 < 2.2e-16 ***
## Year_num:Month 55.80 11 5.523e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
hist(poisson_reg_fit_final$residuals)plot(fitted(poisson_reg_fit_final), resid(poisson_reg_fit_final) )# calculate the mean squared error
mse_poisson <-
sum(market_sale_final$Sales -
predict.glm(poisson_reg_fit_final,
newdata = market_sale_final,
type = "response") ) ^2The AIC (1.1900615^{4}) and mean square error (1.2061961^{-17}) from the Poisson regression model are slightly better than the estimates from the linear regression above (AIC = 1.1394401^{4} and mean square error = 2.1707956^{-13}). This histogram of the residual is also relatively normal, and the residual plot is randomly scattered around the residual of zero. Thus, we will use the Poisson regression for our final prediction.
summary(poisson_reg_fit_final) ##
## Call:
## glm(formula = Sales ~ Ad_Spend + Year_num + Month + COVID_Lockdown +
## Phone_24 + Positive_News + Negative_News + Year_num:Month,
## family = poisson(link = "identity"), data = market_sale_final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.07349 -0.47085 -0.00356 0.39677 2.68826
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.342e+03 1.041e+03 6.090 1.13e-09 ***
## Ad_Spend 1.423e-03 3.445e-05 41.314 < 2e-16 ***
## Year_num -3.118e+00 5.159e-01 -6.044 1.51e-09 ***
## MonthFebruary -4.662e+03 1.332e+03 -3.501 0.000464 ***
## MonthMarch -3.444e+03 1.300e+03 -2.649 0.008084 **
## MonthApril -6.375e+03 1.378e+03 -4.628 3.70e-06 ***
## MonthMay -6.319e+03 1.329e+03 -4.755 1.98e-06 ***
## MonthJune -6.912e+03 1.352e+03 -5.113 3.18e-07 ***
## MonthJuly -7.512e+03 1.359e+03 -5.527 3.26e-08 ***
## MonthAugust -7.721e+03 1.397e+03 -5.527 3.26e-08 ***
## MonthSeptember -6.705e+03 1.453e+03 -4.614 3.95e-06 ***
## MonthOctober -5.087e+03 1.450e+03 -3.509 0.000450 ***
## MonthNovember -6.109e+03 1.543e+03 -3.959 7.54e-05 ***
## MonthDecember -6.935e+03 2.006e+03 -3.457 0.000546 ***
## COVID_Lockdown1 -3.469e+01 1.180e+00 -29.386 < 2e-16 ***
## Phone_241 1.102e+01 8.003e-01 13.773 < 2e-16 ***
## Positive_News1 1.232e+01 1.245e+00 9.893 < 2e-16 ***
## Negative_News1 -2.135e+01 1.909e+00 -11.184 < 2e-16 ***
## Year_num:MonthFebruary 2.310e+00 6.598e-01 3.501 0.000464 ***
## Year_num:MonthMarch 1.707e+00 6.443e-01 2.650 0.008057 **
## Year_num:MonthApril 3.160e+00 6.826e-01 4.629 3.68e-06 ***
## Year_num:MonthMay 3.133e+00 6.584e-01 4.758 1.95e-06 ***
## Year_num:MonthJune 3.428e+00 6.699e-01 5.117 3.10e-07 ***
## Year_num:MonthJuly 3.726e+00 6.735e-01 5.533 3.15e-08 ***
## Year_num:MonthAugust 3.834e+00 6.923e-01 5.538 3.05e-08 ***
## Year_num:MonthSeptember 3.334e+00 7.201e-01 4.631 3.64e-06 ***
## Year_num:MonthOctober 2.536e+00 7.183e-01 3.530 0.000415 ***
## Year_num:MonthNovember 3.049e+00 7.646e-01 3.988 6.66e-05 ***
## Year_num:MonthDecember 3.463e+00 9.943e-01 3.483 0.000495 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 11657.40 on 1795 degrees of freedom
## Residual deviance: 777.98 on 1767 degrees of freedom
## AIC: 11901
##
## Number of Fisher Scoring iterations: 4
sum_stats <- summary(poisson_reg_fit_final)There are some interesting insights has shown from this Poisson regression model:
For every $1000 spent on the advertising the total daily units sold is increased by 1.42,
For every one year increase, the average total daily units sold is increase by 3.12,
On the other hand, December is the best performing month in terms of total daily units sold, 6935.1 more compared to January.
COVID Lockdown has been shown to reduce the total daily units sold by 34.69.
Positive coverage has been shown to increase the total daily units sold by 12.32,
Negative coverage has shown to reduce the total daily units sold by 21.35
Having the 24/7 call line increase the total daily units sold by 11.02
Finally, the final multiple linear regression model suggests there are no statistically significant effects on the total units sold on:
Exacting the final predicted sales for December of 2020 with the 95% confidence intervals using the final Poisson regression model.
poisson_reg_predict <-
predict.glm(poisson_reg_fit_final,
newdata = dec_ad_final,
type = "response",
se.fit = TRUE)
dec_ad_final$Sales <- round(poisson_reg_predict$fit)
dec_ad_final$Sales_max <- with(poisson_reg_predict, fit + 1.96*se.fit) |> round()
dec_ad_final$Sales_min <- with(poisson_reg_predict, fit - 1.96*se.fit) |> round()Table of the daily total sales for December of 2020.
dec_ad_final %>%
select(Date, Sales, Sales_min, Sales_max) %>%
knitr::kable()| Date | Sales | Sales_min | Sales_max |
|---|---|---|---|
| 2020-12-01 | 120 | 115 | 124 |
| 2020-12-02 | 116 | 112 | 121 |
| 2020-12-03 | 112 | 107 | 116 |
| 2020-12-04 | 116 | 111 | 121 |
| 2020-12-05 | 135 | 130 | 140 |
| 2020-12-06 | 105 | 101 | 110 |
| 2020-12-07 | 122 | 117 | 127 |
| 2020-12-08 | 117 | 113 | 122 |
| 2020-12-09 | 128 | 123 | 132 |
| 2020-12-10 | 107 | 102 | 111 |
| 2020-12-11 | 110 | 105 | 115 |
| 2020-12-12 | 124 | 119 | 128 |
| 2020-12-13 | 135 | 130 | 140 |
| 2020-12-14 | 113 | 108 | 118 |
| 2020-12-15 | 111 | 107 | 116 |
| 2020-12-16 | 120 | 115 | 125 |
| 2020-12-17 | 116 | 112 | 121 |
| 2020-12-18 | 109 | 104 | 114 |
| 2020-12-19 | 117 | 112 | 122 |
| 2020-12-20 | 132 | 127 | 137 |
| 2020-12-21 | 109 | 104 | 114 |
| 2020-12-22 | 137 | 132 | 142 |
| 2020-12-23 | 115 | 110 | 119 |
| 2020-12-24 | 128 | 123 | 132 |
| 2020-12-25 | 105 | 100 | 110 |
| 2020-12-26 | 124 | 120 | 129 |
| 2020-12-27 | 106 | 101 | 111 |
| 2020-12-28 | 105 | 101 | 110 |
| 2020-12-29 | 146 | 141 | 151 |
| 2020-12-30 | 123 | 118 | 127 |
| 2020-12-31 | 118 | 113 | 123 |
The Poisson regression model predicts the total unit sales in December 2020 is 3681 with 95% confidence intervals of 3533 and 3828. Thus, the model suggests that the target sale of 3900 units in December 2020 is unlikely to meet.
The plot below is an interactive plot that contains both the historical (green colour-coded) and forecast (red colour-coded) sales with 95% confidence intervals. The user can hover the mouse cursor over the plot to see the actual estimates and highlight a session of the plot to zoom-in.
market_sale_final$Type <- "Historical"
dec_ad_final$Type <- "Forecast"
market_sale_final$Sales_max <-
market_sale_final$Sales_min <-
market_sale_final$Sales
g <-
market_sale_final %>%
bind_rows(dec_ad_final) %>%
mutate(Type = factor(Type)) %>%
ggplot(aes(x = Date, y = Sales, col = Type, group = Type)) +
geom_path() +
geom_ribbon(aes(ymin = Sales_min, ymax = Sales_max, fill = Type), alpha = 0.2) +
scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01),
date_labels = "%b %Y") +
theme_light()
plotly::ggplotly(g) %>% plotly::hide_legend()An alternative method is using a time-series-based
analysis with the forecast R package, which can
better handle autocorrelation, seasonality, and other temporal patterns
in the data. However, applying the method within the
forecast R package requires the dataset to be converted to
a Time-Series (ts) object. This process can be complicated,
especially if we need to include other predictors such as “Total
Advertising Spend” in the model for inference and prediction.
In addition, I have also attempted to use the eXtreme Gradient
Boosting (XGBoost) model to see if I can obtain a model with an even
smaller mean square error to ensure the total sales prediction is
accurate and robust, as this method has become a popular method on
constructing a predictive machine learning model. However, I have found
that my linear regression model gave much better mean square error
estimates for this particular dataset. Thus, I have not include it in
the final report. Please find its implementation in the
1_modelling.R file of the Rcode folder.